In [ ]:
from IPython.display import Image
list
,tuple
,dict
),
In [ ]:
import numpy as np
In [ ]:
lst = list(range(1000))
arr = np.arange(1000)
Here's what the array looks like
In [ ]:
arr[:10]
In [ ]:
arr[10:20]
In [ ]:
arr[10:20:2]
In [ ]:
type(arr)
In [ ]:
%timeit [i ** 2 for i in lst]
In [ ]:
%timeit arr ** 2
We can index arrays in the same ways as lists
In [ ]:
arr[5:10]
In [ ]:
arr[-1]
In [ ]:
['a', 2, (1, 3)]
In [ ]:
lst[0] = 'some other type'
In [ ]:
lst[:3]
In [ ]:
arr[0] = 'some other type'
dtype
attribute
In [ ]:
arr.dtype
In [ ]:
arr[0] = 1.234
In [ ]:
arr[:10]
In [ ]:
Image("https://docs.scipy.org/doc/numpy/_images/threefundamental.png")
In [ ]:
np.zeros(5, dtype=float)
In [ ]:
np.zeros(5, dtype=int)
In [ ]:
np.zeros(5, dtype=complex)
In [ ]:
np.ones(5, dtype=float)
arange
function generates an array for a range of integers.linspace
and logspace
functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:
In [ ]:
np.linspace(0, 1, num=5)
In [ ]:
np.logspace(1, 4, num=4)
Finally, it is often useful to create arrays with random numbers that follow a specific distribution. The np.random
module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1) $ X \sim N(0, 1) $:
In [ ]:
np.random.randn(5)
$X \sim N(9, 3)$
In [ ]:
norm10 = np.random.normal(loc=9, scale=3, size=10)
In [ ]:
%load solutions/random_number.py
Consider for example that in the array norm10
we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is True
or False
:
In [ ]:
mask = norm10 > 9
mask
In [ ]:
norm10[mask]
In [ ]:
norm10[[1, 4, 6]]
In [ ]:
norm10[norm10 > 9] = 0
In [ ]:
norm10
In [ ]:
norm10[[1, 4, 7]] = 10
In [ ]:
norm10
In [ ]:
x = np.arange(10)
In [ ]:
x
In [ ]:
y = x[::2]
y
In [ ]:
y[3] = 100
y
In [ ]:
x
In [ ]:
a = norm10[[0, 1, 5]]
In [ ]:
a
In [ ]:
a[:] = -10
In [ ]:
a
In [ ]:
norm10
Create an array [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
without typing the values by hand. Assign 100 to elements 2 to 5 (zero-index). Print the array.
Create the same array as in step one above. Create an array from a slice of elements 2 to 5. Assign 100 to the slice. Hint try [:]
to address all of the elements of an array. Print the original array and the slice.
In [ ]:
# [Solution here]
In [ ]:
%load solutions/copies_vs_views.py
In [ ]:
samples_list = [[632, 1638, 569, 115], [433,1130,754,555]]
samples_array = np.array(samples_list)
samples_array.shape
In [ ]:
print(samples_array)
With two-dimensional arrays we start seeing the convenience of NumPy data structures: while a nested list can be indexed across dimensions using consecutive [ ]
operators, multidimensional arrays support a more natural indexing syntax with a single set of brackets and a set of comma-separated indices:
In [ ]:
samples_list[0][1]
In [ ]:
samples_array[0,1]
Most of the array creation functions listed above can be passed multidimensional shapes. For example:
In [ ]:
np.zeros((2,3))
In [ ]:
np.random.normal(10, 3, size=(2, 4))
In fact, an array can be reshaped at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is via the array's reshape
method.
In [ ]:
arr = np.arange(8).reshape(2,4)
arr
With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions (using the same array as above):
In [ ]:
arr[1, 2:4]
In [ ]:
arr[:, 2]
If you only provide one index, then you will get the corresponding row.
In [ ]:
arr[1]
Now that we have seen how to create arrays with more than one dimension, it's a good idea to look at some of the most useful properties and methods that arrays have. The following provide basic information about the size, shape and data in the array:
In [ ]:
print('Data type :', samples_array.dtype)
print('Total number of elements :', samples_array.size)
print('Number of dimensions :', samples_array.ndim)
print('Shape (dimensionality) :', samples_array.shape)
print('Memory used (in bytes) :', samples_array.nbytes)
Arrays also have many useful methods, some especially useful ones are:
In [ ]:
print('Minimum and maximum :', samples_array.min(), samples_array.max())
print('Sum, mean and standard deviation:', samples_array.sum(), samples_array.mean(), samples_array.std())
For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the axis
parameter; for example:
In [ ]:
samples_array.sum(axis=0)
In [ ]:
samples_array.sum(axis=1)
keepdims
keyword
In [ ]:
samples_array.sum(axis=1, keepdims=True)
Another widely used property of arrays is the .T
attribute, which allows you to access the transpose of the array:
In [ ]:
samples_array.T
There is a wide variety of methods and properties of arrays.
In [ ]:
[attr for attr in dir(samples_array) if not attr.startswith('__')]
shape
and strides
we can interpet bytes laid out linearly in memory as a multidimensional object
In [ ]:
Image('https://ipython-books.github.io/images/layout.png')
In [ ]:
%load solutions/matrix_creation.py
In [ ]:
sample1 = np.array([632, 1638, 569, 115])
sample2 = np.array([433,1130,754,555])
sample_sum = sample1 + sample2
In [ ]:
np.array([632, 1638, 569, 115])
This includes the multiplication operator -- it does not perform matrix multiplication, as is the case in Matlab, for example:
In [ ]:
print('{0} X {1} = {2}'.format(sample1, sample2, sample1 * sample2))
In Python 3.5, you can use the @
operator to get the inner product (or matrix multiplication) (!)
In [ ]:
print('{0} . {1} = {2}'.format(sample1, sample2, sample1 @ sample2))
In [ ]:
sample1 + 1.5
In this case, numpy looked at both operands and saw that the first was a one-dimensional array of length 4 and the second was a scalar, considered a zero-dimensional object. The broadcasting rules allow numpy to:
So in the above example, the scalar 1.5 is effectively cast to a 1-dimensional array of length 1, then stretched to length 4 to match the dimension of arr1. After this, element-wise addition can proceed as now both operands are one-dimensional arrays of length 4.
This broadcasting behavior is powerful, especially because when NumPy broadcasts to create new dimensions or to stretch existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This saves memory and improves the performance of operations.
When broadcasting, NumPy compares the sizes of each dimension in each operand. It starts with the trailing dimensions, working forward and creating dimensions as needed to accomodate the operation. Two dimensions are considered compatible for operation when:
If these conditions are not met, an exception is thrown, indicating that the arrays have incompatible shapes.
In [ ]:
sample1 + np.array([7,8])
In [ ]:
b = np.array([10, 20, 30, 40])
bcast_sum = sample1 + b
In [ ]:
print('{0}\n\n+ {1}\n{2}\n{3}'.format(sample1, b, '-'*21, bcast_sum))
In [ ]:
c = np.array([-100, 100])
sample1 + c
Remember that matching begins at the trailing dimensions. Here, c would need to have a trailing dimension of 1 for the broadcasting to work. We can augment arrays with dimensions on the fly, by indexing it with a np.newaxis object, which adds an "empty" dimension:
In [ ]:
cplus = c[:, np.newaxis]
cplus
In [ ]:
cplus.shape
In [ ]:
sample1 + cplus
In [ ]:
sample1[:, np.newaxis] + c
Divide each column of the array:
a = np.arange(25).reshape(5, 5)
elementwise with the array
b = np.array([1., 5, 10, 15, 20])
In [ ]:
# [Solution here]
In [ ]:
%load solutions/broadcasting.py
pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.
In [ ]:
import pandas as pd
Series
can be thought of as an ordered key-value store.
In [ ]:
counts = pd.Series([632, 1638, 569, 115])
counts
If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series
, while the index is a pandas Index
object.
In [ ]:
counts.values
Pandas provides a labeled index to access the rows
In [ ]:
counts.index
We can assign meaningful labels to the index, if they are available:
In [ ]:
bacteria = pd.Series([632, 1638, 569, 115],
index=['Firmicutes', 'Proteobacteria',
'Actinobacteria', 'Bacteroidetes'])
bacteria
NumPy's math functions and other operations can be applied to Series without losing the data structure.
In [ ]:
np.log(bacteria)
In [ ]:
bacteria_dict = {
'Firmicutes': 632,
'Proteobacteria': 1638,
'Actinobacteria': 569,
'Bacteroidetes': 115
}
pd.Series(bacteria_dict)
Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.
A DataFrame
is a tabular data structure, encapsulating multiple series like columns in a spreadsheet.
In [ ]:
data = pd.DataFrame({'value': [632, 1638, 569, 115, 433, 1130, 754, 555],
'patient': [1, 1, 1, 1, 2, 2, 2, 2],
'phylum': ['Firmicutes', 'Proteobacteria', 'Actinobacteria',
'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
'Actinobacteria', 'Bacteroidetes']})
data
head
to do this
In [ ]:
data.head()
The first axis of a DataFrame also has an index that represent the labeled columns
In [ ]:
data.columns
read_csv
is a highly optimized csv reader
In [ ]:
vessels = pd.read_csv("../data/AIS/vessel_information.csv")
vessels.head()
../data/NationalFoodSurvey/NFS_1974.csv
In [ ]:
%load solutions/read_nfs_1974.py